169 research outputs found

    An Efficient OpenMP Runtime System for Hierarchical Arch

    Get PDF
    Exploiting the full computational power of always deeper hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture. The emergence of multi-core chips and NUMA machines makes it important to minimize the number of remote memory accesses, to favor cache affinities, and to guarantee fast completion of synchronization steps. By using the BubbleSched platform as a threading backend for the GOMP OpenMP compiler, we are able to easily transpose affinities of thread teams into scheduling hints using abstractions called bubbles. We then propose a scheduling strategy suited to nested OpenMP parallelism. The resulting preliminary performance evaluations show an important improvement of the speedup on a typical NAS OpenMP benchmark application

    An Efficient and Transparent Thread Migration Scheme in the PM2 Runtime System

    Get PDF
    International audienceThis paper describes a new iso-address approach to the dynamic allocation of data in a multithreaded runtime system with thread migration capability. The system guarantees that the migrated threads and their associated static data are relocated exactly at the same virtual address on the destination nodes, so that no post-migration processing is needed to keep pointers valid. In the experiments reported, a thread can be migrated in less than 75ÎŒs

    EASYPAP: a Framework for Learning Parallel Programming

    Get PDF
    This paper presents EASYPAP, an easy-to-use programming environment designed to help students to learn parallel programming. EASYPAP features a wide range of 2D computation kernels that the students are invited to parallelize using Pthreads, OpenMP, OpenCL or MPI. Execution of kernels can be interactively visualized, and powerful monitoring tools allow students to observe both the scheduling of computations and the assignment of 2D tiles to threads/processes. By focusing on algorithms and data distribution, students can experiment with diverse code variants and tune multiple parameters, resulting in richer problem exploration and faster progress towards efficient solutions. We present selected lab assignments which illustrate how EASYPAP improves the way students explore parallel programming

    Efficient shared memory message passing for inter-VM communications

    Get PDF
    Thanks to recent advances in virtualization technologies, it is now possible to beneïŹt from the ïŹ‚exibility brought by virtual machines at little cost in terms of CPU performance. However on HPC clusters some overheads remain which prevent widespread usage of virtualization. In this article, we tackle the issue of inter-VM MPI communications when VMs are located on the same physical machine. To achieve this we introduce a virtual device which provides a simple message passing API to the guest OS. This interface can then be used to implement an efficient MPI library for virtual machines. The use of a virtual device makes our solution easily portable across multiple guest operating systems since it only requires a small driver to be written for this device. We present an implementation based on Linux, the KVM hypervisor and Qemu as its userspace device emulator. Our implementation achieves near native performance in terms of MPI latency and bandwidth

    A unified runtime system for heterogeneous multicore architectures

    Get PDF
    International audienceApproaching the theoretical performance of heterogeneous multicore architectures, equipped with specialized accelerators, is a challenging issue. Unlike regular CPUs that can transparently access the whole global memory address range, accelerators usually embed local memory on which they perform all their computations using a specific instruction set. While many research efforts have been devoted to offloading parts of a program over such coprocessors, the real challenge is to find a programming model providing a unified view of all available computing units. In this paper, we present an original runtime system providing a high-level, unified execution model allowing seamless execution of tasks over the underlying heterogeneous hardware. The runtime is based on a hierarchical memory management facility and on a codelet scheduler. We demonstrate the efficiency of our solution with a LU decomposition for both homogeneous (3.8 speedup on 4 cores) and heterogeneous machines (95% efficiency). We also show that a "granularity aware" scheduling can improve execution time by 35%

    SPAWN: An Iterative, Potentials-Based, Dynamic Scheduling and Partitioning Tool

    Get PDF
    International audienceMany applications of physics modeling use regular meshes on which computations of highly variable cost can occur. Distributing the underlying cells over manycore architec-tures is a critical load balancing step that should increase the period until another step is required. Graph partitioning tools are known to be very effective for such problems, but they exhibit scalability problems as the number of cores and the number of cells increases. We introduce a dynamic task scheduling approach inspired by physical particles interactions. Our method allows cores to virtually move over a 2D/3D mesh of tasks and uses a Voronoi domain decomposition to balance workload among cores. Displacements of cores are the result of force computations using a carefully chosen pair potential. We evaluate our method against graph partitioning tools and existing task schedulers with a representative physical application, and demonstrate the relevance of our approach

    A Multithreaded Runtime Environment with Thread Migration for HPF and C* Data-Parallel Compilers

    Get PDF
    This paper studies the benefits of compiling data-parallel languages onto a multithreaded runtime environment providing dynamic thread migration facility. Each abstract process is mapped onto a thread, so that dynamic load balancing can be achieved by migrating threads among the processing nodes. We describe and evaluate an implementation of this idea in the Adaptor HPF and the UNH C* data-parallel compilers. We show that no deep modifications of the compilers are needed, and that the overhead of managing threads can be kept small. As an experimental validation, we report on an HPF implementation of the Gauss Partial Pivoting algorithm. We show that the initial BLOCK data distribution with our dynamic load balancing scheme can reach the performance of the optimal CYCLIC distribution

    NewMadeleine : ordonnancement et optimisation de schemas de communication haute performance.

    Get PDF
    National audienceMalgrĂ© les progrĂšs spectaculaires accomplis par les interfaces de communication pour rĂ©seaux rapides ces quinze derniĂšres annĂ©es, de nombreuses optimisations potentielles Ă©chappent encore aux bibliothĂšques de communication. La faute en revient principalement Ă  une conception focalisĂ©e sur la rĂ©duction Ă  l'extrĂȘme du chemin critique afin de minimiser la latence. Dans cet article, nous prĂ©sentons une nouvelle architecture de bibliothĂšque de communication bĂątie autour d'un puissant moteur d'optimisation des transferts dont l'activitĂ© s'accorde avec celle des cartes rĂ©seau. Le code des stratĂ©gies d'optimisations est gĂ©nĂ©rique et portable, et il est paramĂ©trĂ© Ă  l'exĂ©cution par les capacitĂ©s des pilotes rĂ©seau sous-jacents. La base de donnĂ©es des stratĂ©gies d'optimisation prĂ©dĂ©finies est facilement extensible. L'ordonnanceur est en outre capable de mixer de façon globalisĂ©e de multiples flux logiques sur une ou plusieurs cartes physiques, potentiellement de technologies diffĂ©rentes en multi-rail hĂ©tĂ©rogĂšne

    Improving Reactivity to I/O Events in Multithreaded Environments Using a Uniform, Scheduler-Centric API

    Get PDF
    Reactivity to I/O events is a crucial factor for the performance of modern multithreaded distributed systems. In our scheduler-centric approach, an application detects I/O events by requesting a service from a detection server, through a simple, uniform API. We show that a good choice for this detection server is the thread scheduler. This approach simplifies application programming, significantly improves performance, and provides a much tighter control on reactivity
    • 

    corecore